<!DOCTYPE html>
<html>
    <head>
        <meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=1.0,minimum-scale=1.0,user-scalable=no">
        <title>ProteinKG25</title>
        <style type="text/css">
            #none{
                text-decoration: none;
                color: blue;
            }
            #overline{
                text-decoration: overline;
            }
            #line-through{
                text-decoration: line-through;
            }
            .footer{
                font-family: Times;
                font-size: 20px;
                line-height: 1.5;
                word-wrap: break-word;
                box-sizing: border-box;
                border-color: #eaecef;
                border-top: 1px solid #595959;;
                color: #586069;
                margin-top: 32px;
                margin-bottom: 20px;
                padding-top: 16px;
                padding-bottom: 40px;
                text-align: right;
                margin-bottom: 0;
            }
            .d1{
                width: 1000px;
                height: 1000px;
                line-height: 40px;
                margin-left: 15%;
                margin-right: 15%;
                font-size: 18px;
                position: center;
                margin: 0 auto;
            }
            .set_table{
                line-height: 30px;
                font-style: italic;
                font-family: Times;
            }
            .add_border_line{
                border-bottom: 1px solid #595959;
                border-top: 1px solid #595959;
            }
            .add_top_border_line{
                border-top: 1px solid #595959;
            }
        </style>
    </head>
    <body>
        <div class="d1">
            <h1>ProteinKG25</h1>
            <hr />
            <img src="../figures/ProteinKG25.svg" width="1000">


            <h2>Introduction</h2>
            <hr />
            <p>ProteinKG25 is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO term and proteins entities. It contains about 612,483 entities, 4,990,097 triples (including 4,879,951 protein-go triplets and 110,146 Go-Go triplets). We provide both the inductive and the transductive settings used in the <a href="https://arxiv.org/pdf/2201.11147.pdf" id="none">original paper</a>.
            </p>
            <div style="width: 100%; background-color: white;">
                <table width="900" border="0" cellpadding="7" cellspacing="0" bgcolor="#FFFFFF" style="display: inline-block;" class="set_table">
                  <tr>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center">Setting</div></td>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center"></div>&emsp;&emsp;</td>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center">&emsp;#protein entity</div></td>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center">&emsp;#go entity</div></td>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center">&emsp;#relation</div></td>
                    <td bgcolor="#F6F6F6" class="add_border_line"><div align="center">&emsp;#triplet</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#FFFFFF"><div align="center"></div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;train</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;541,916</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;28,610</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;52</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;4,861,576</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#F6F6F6"><div align="center">Transductive</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;valid</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;7,834</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;1,865</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;44</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;12,354</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#FFFFFF"><div align="center"></div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;test</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;381,428</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;10,611</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;54</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;1,267,362</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center"></div></td>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center">&emsp;&emsp;train</div></td>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center">&emsp;&emsp;541,916</div></td>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center">&emsp;&emsp;28,610</div></td>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center">&emsp;&emsp;52</div></td>
                    <td bgcolor="#F6F6F6" class="add_top_border_line"><div align="center">&emsp;&emsp;4,861,576</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#FFFFFF"><div align="center">Inductive</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;valid</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;608</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;118</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;12</div></td>
                    <td bgcolor="#FFFFFF"><div align="center">&emsp;&emsp;1,170</div></td>
                  </tr>
                  <tr>
                    <td bgcolor="#F6F6F6"><div align="center"></div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;test</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;1,553</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;541</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;37</div></td>
                    <td bgcolor="#F6F6F6"><div align="center">&emsp;&emsp;4,405</div></td>
                  </tr>
                </table>
                
            </div>
            <p>We get Go term information from <a href="http://geneontology.org/" id='none'>Gene Ontology</a>. Gene Ontology is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species, which consists of a set of Go terms(or concepts) with relations that operate between them. The vocabulary of genes and gene products involved in Gene Ontology is divided into three categories, covering three aspects of biology:</p>
            <ul style="background-color: #FBEFF2; height: 126px; padding-top: 5px; border-radius: 15px; padding-left: 100px;">
                <li>Molecular Function</li>
                <li>Cellular Component</li>
                <li>Biological Process</li>
            </ul>
            
            <h2>Go term</h2>
            <hr />
            <p>All entities in Gene Ontology belong to BPO, MFO and CCO. The relationship between Go term and Protein is mainly concentrated in the following tree types: </p>
            <ul style="background-color: #FBEFF2; height: 126px; padding-top: 5px; border-radius: 15px; padding-left: 100px;">
                <li>part of &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&ensp;&emsp;&emsp;&emsp;&emsp;(1,110,624)</li>
                <li>enables &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;(1,944,197)</li>
                <li>involved in &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&ensp;&emsp;&emsp;&emsp;&emsp;(1,720,433)</li>
            </ul>

            <h2>Data</h2>
            <hr />
            <p>
                <a href="https://drive.google.com/file/d/1iTC2-zbvYZCDhWM_wxRufCvV6vvPk8HR/view">Download ProteinKG25</a>
            </p>


            <h2>Details</h2>
            <hr />
            ProteinKG25 follows Gene Ontology and Gene Annotations released in April 2020. Each Entity(Go term entity or Protein entity) is identified by a unique ID.
            <p>
                Protein-Go triplet example:
                <div style="background-color: #FBEFF2; height: 40px; padding-top: 5px; border-radius: 15px; padding-left: 100px; margin: 20px;">71090(Q14028) &emsp;&emsp;&emsp;&emsp;&emsp; 36(enables_nucleotide_binding) &emsp;&emsp;&emsp;&emsp;&emsp;  117(GO:0000166)</div>
                GO-Go triplet example:
                <div style="background-color: #FBEFF2; height: 40px; padding-top: 5px; border-radius: 15px; padding-left: 100px; margin: 20px;">0(GO:0000001) &emsp;&emsp;&emsp;&emsp;&emsp; 0(is_a) &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;  23558(GO:0048308)</div>
                Protein Sequence example:
                <div style="background-color: #FBEFF2; height: 40px; padding-top: 5px; border-radius: 15px; padding-left: 100px; margin: 20px;">P0DQM8: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;QKCCSGGSCPLYFRDRLICPCC</div>
            </p>


            <h2>Publication</h2>
            <hr />
            <ul>
  
                  <li>
                    <b>OntoProtein: Protein Pretraining With Gene Ontology Embedding</b>
                    <br>
                    
                    Ningyu Zhang,
                    
                    Zhen Bi,
                    
                    Xiaozhuan Liang,
                    
                    Siyuan Cheng,
                    
                    Haosen Hong,
                    
                    Shumin Deng,

                    Qiang Zhang,

                    Jiazhang Lian,

                    Huajun Chen
                    
                    <br>
                    <b class="text-red" style="padding-right: 10px">ICLR 2022</b>
                    
                    <a href="https://arxiv.org/pdf/2201.11147.pdf" style="padding-right: 10px" id="none">arXiv</a>
                    
                  </li>
                  
                </ul>
                <h2>Cite</h2>
                <hr />
                <div style="background-color: #FBEFF2; height: 390px; padding-top: 3px; border-radius: 15px; padding-left: 30px; font-size: 15px; font-style:italic;">
@inproceedings{<br>
zhang2022ontoprotein,<br>
title={OntoProtein: Protein Pretraining With Gene Ontology Embedding},<br>
author={Ningyu Zhang and Zhen Bi and Xiaozhuan Liang and Siyuan Cheng and Haosen Hong and Shumin Deng and Qiang Zhang and Jiazhang Lian and Huajun Chen},<br>
booktitle={International Conference on Learning Representations},<br>
year={2022},<br>
url={https://openreview.net/forum?id=yfe1VMYAXa4}<br>
}
                </div>
                <div class="footer">
                    AZFT & ZJUNLP
                </div>


        </div>
        
    </body>
</html>
